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Abstract 

In leading morpho-phonological theories and 
state-of-the-art text-to-speech systems it is 
assumed that word pronunciation cannot be 
learned or performed without in-between anal- 
yses at several abstraction levels (e.g., mor- 
phological, graphemic, phonemic, syllabic, and 
stress levels). We challenge this assump- 
tion for the case of English word pronunci- 
ation. Using igtree, an inductive-learning 
decision-tree algorithms, we train and test 
three word-pronunciation systems in which the 
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other language tasks as well) is usually referred 
to as the analogy principle(De Saussure, 191ft 
Yvon, 1996| ; [Daelemans, 199ft ) . 



number of abstraction levels (Implemented 
Goquonood moduloa) ia roduood from five, v 



three, to one. The latter system, classifying 
letter strings directly as mapping to phonemes 



with stress markers, yields significantly better 
generalisation accuracies than the two multi- 
module systems. Analyses of empirical results 
indicate that positive utility effects of sequenc- 
ing modules are outweighed by cascading er- 
rors passed on between modules. 



1 Introduction 

Learning word pronunciation can be a hard task 
when the relation between the spelling of a language 
and its corresponding pronunciation is many-to- 
many. The English writing system and its pronunci- 
ation are a notoriously complex example, caused by 
an apparent conflict between analogy and inconsis- 
tency: 

Analogy. When two words or word chunks have a 
similar spelling, they tend to have a similar pro- 
nunciation. This tendency (which generalises to 
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Inconsistency. Much of the analogy in English 
word pronunciation is disrupted by productive 
and complex word morphology, word stress, and 
graphematics. 

Influential pre-Chomskyan linguistic theories have 
been pointing at the analogy principle as the under- 



lying principle for language learning ( Dc Saussure 
1916|), and at induction as the reasoning method 



for generalising from learned instances of language 



tasks to new instances through analogy ( Bloomficld 



1933). However, methods and resources (e.g., com- 



puter technology) were not available then to demon- 
strate how induction through analogy could be em- 
ployed to learn and model language tasks. Partly 
due to this lack of demonstrating power, Chomsky 
later stated 

"... I don't see any way of explaining the 
resulting final state [of language learning] 
in terms of any proposed general devel- 
opmental mechanism that has been sug- 
gested by artificial intelligence, sensorimo- 
tor mechanisms, or anything else" (Chom- 
sky, in flPiatelli-Palmarini, 198fj| ), p. 100). 



Chomsky's argument is based on the assump- 
tion that generic learning methods such as induc- 
tion cannot discover autonomously essential levels 
of abstraction in language processing tasks. Ap- 
plied to morpho-phonology, the argument states that 
generic learning methods are not able to discover 
morphology, graphematics, and stress patterns au- 
tonomously when learning word pronunciation, al- 
though this knowledge appears essential. Phonologi- 
cal and morphological theories, influenced by Chom- 
skyan theory across the board since the publica- 
tion of SPE ( Chomsky and Halle, 1968 ), have gen- 



erally adopted the idea of abstraction levels i n var- 
ious guises ( e .g., levels, tapes, tiers, grid s ) (|Gold- 



smith, 1976; Liberman and Prince, 1977] ; [Kosken 



menu, 1984| ; |Mohanan, 1986| ). Although there is no 



general consensus on which levels of abstraction can 
be discerned in phonology and morphology, there is 
a rough, global agreement on the fact that words 
can be represented on different abstraction levels as 
strings of letters, graphemes, morphemes, phonemes, 
syllables, and stress patterns. 

According to these leading morpho-phonological 
theories, systems that (learn to) convert spelled 
words to phonemic words in one pass, i.e., without 
making use of abstraction levels, are assumed to be 
unable to generalise to new cases: going through 
the relevant abstraction levels is deemed essential to 
yield correct conversions of previously unseen words. 
This assumption implies that if one wants to build 
a system that converts text to speech, one should 
implement explicitly the relevant levels of abstrac- 
tion. Such explicit implementations of abstraction 
levels can indeed be witnessed in many state-of-the- 
art speec h synthesisers, implemented as (sequential) 



modules ([Allen, Hunnicutt, and Klatt, 1987| ; |Daele- 



mans, 1 



In this paper we challenge the assumption that 
levels of abstraction must be made explicit in learn- 
ing and performing the word-pronunciation task. 
We do this by applying an inductive-learning al- 
gorithm from machine learning to word pronunci- 
ation. From a wealth of existing algorithms in ma- 
chine learning (Mitchell, 1997), we choose igtree 



(Daelemans, Van den Bosch, and Weijters, 1997), an 
inductive-learning decision-tree learning algorithm. 
igtree is a fast algorithm which has been demon- 



strated to be applicable to language tasks ( 


Van 


den Bosch and Daelemans, 1993; 


Van den Bosch. 


Daelemans, and Weijters, 1996; 


Daelemans, Van den 


Bosch, and Weijters, 1997 


). We construct igtree 



decision trees for word pronunciation, and perform 
empirical tests to estimate the trees' generalisation 
accuracy, i.e., their ability to process new, unseen 
word-pronunciation instances correctly. 

Rather than constructing and testing a single sys- 
tem, our approach is to test different modulari- 
sations of the word-pronunciation task systemati- 
cally, to allow for an empirical comparison of word- 
pronunciation systems with and without the explicit 
learning of abstraction levels. First, we train (by 
inductive learning) and test a word-pronunciation 
model reflecting linguistic assumptions on abstrac- 
tion levels quite closely: the model is composed of 
five sequentially-coupled modules. Second, we train 
and test a model in which the number of modules 



is reduced to three, integrating two pairs of levels 
of abstraction. Third, we train and test a model 
performing word pronunciation in a single pass, i.e., 
without modular decomposition. 

The paper is structured as follows: first, in Sec- 
tion H we provide a description of igtree, the data 
on which the igtree is trained and tested, and the 
applied experimental methodology. Second, in Sec- 
tion we introduce the three word-pronunciation 
systems, and for each system we describe the exper- 
iments performed and discuss the results obtained. 
In Section ^ we compare the three systems and anal- 
yse the consequences of modularisation. Section || 
briefly mentions related work on inductive learning 
of word pronunciation. Section || summarises the 
results obtained and lists some points of discussion. 

2 Algorithm, Data, Methodology 
2.1 Algorithm: IGTREE 



igtree (Daelemans, Van den Bosch, and Weij- 



tcrs, 1997) is a top-down induction of decision trees 



(tdidt) algorithm (Breiman et al., 1984; Quinlan 



1993). tdidt is a widely-used method in super- 
vised machine learning (Mitchell, 1997). igtree 



is designed as an optimised approximation of the 



instance-based learning algorithm ib1-ig (Daele- 



mans and Van den Bosch, 1992; Daelemans, Van 



den Bosch, and Weijters, 1997). In igtree, infor- 



mation gain is used as a guiding function to com- 
press a data base of instances of a certain task into 
a decision treeQ Instances are stored in the tree as 
paths of connected nodes ending in leaves which con- 
tain classification information. Nodes are connected 
via arcs denoting feature values. Information gain 
is used in igtree to determine the order in which 
feature values are added as arcs to the tree. Informa- 
tion gain is a function fro m information t heory, and 
i s used similarl y in ID3 ( Quinlan, 1986 ) and C4.5 
( |Quinlan, 19931 ). 

The idea behind computing the information gain 
of features is to interpret the training set (i.e., the 
set of task instances for which all classifications are 
given and which are used for training the learning 
algorithm) as an information source capable of gen- 
erating a number of messages (i.e., classifications) 
with a certain probability. The information entropy 
H of such an information source can be compared 
in turn for each of the features characterising the 
instances (let n equal the number of features), to 
the average information entropy of the information 



1 igtree can function w ith any featur e weighting 
method, such as gain ratio ( Quinlan, 1993| ); for all ex- 
periments reported here, information gain was used. 



source when the value of those features are known. 
Data-base information entropy H(D) is equal to the 
number of bits of information needed to know the 
classification given an instance. It is computed by 
equation |l], where pi (the probability of classifica- 
tion i) is estimated by its relative frequency in the 
training set. 



H(D) = - ^2pilog 2 Pi 



(1) 



To determine the information gain of each of the n 
features f% . . . f n , we compute the average informa- 
tion entropy for each feature and subtract it from 
the information entropy of the data base. To com- 
pute the information entropy for a feature fi, given 
in equation 0, we take the weighted average informa- 
tion entropy of the data base restricted to each pos- 
sible value for the feature. The expression D^ i=v .^ 
refers to those patterns in the data base that have 
value Vj for feature fi, j is the number of possible 
values of fi, and V is the set of possible values for 
feature fi. Finally, \D\ is the number of patterns in 
the (sub) data base. 



\D\ 



(2) 



Information gain of feature fi is then obtained by 
equation |[ 



G(/i) = H(D) - H(D [fi _ 



(3) 



In igtree, feature- value information is stored in the 
decision tree on arcs. The first feature values, stored 
as arcs connected to the tree's top node, are those 
representing the values of the feature with the high- 
est information gain, followed at the second level of 
the tree by the values of the feature with the second- 
highest information gain, etc., until the classifica- 
tion information represented by a path is unambigu- 
ous. Knowing the value of the most important fea- 
ture may already uniquely identify a classification, in 
which case the other feature values of that instance 
need not be stored in the tree. Alternatively, it may 
be necessary for disambiguation to store a long path 
in the tree. 

Apart from storing uniquely identified class labels 
at leafs, igtree stores at each non-terminal node in- 
formation on the most probable classification given 
the path so far. The most probable classification is 
the most frequently occurring classification in the 
subset of instances being compressed in the path 
being expanded. Storing the most probable class 
at non-terminal nodes is essential when processing 



new instances. Processing a new instance involves 
traversing the tree by matching the feature values of 
the test instance with arcs the tree, in the order of 
the feature information gain. Traversal ends when 
(i) a leaf is reached or when (ii) matching a feature 
value with an arc fails. In case (i), the classification 
stored at the leaf is taken as output. In case (ii), 
we use the most probable classification on the last 
non-terminal node most recently visited instead. 

2.2 Data Acquisition and Preprocessing 

The resource of word-pronunciation instances used 
in our experiments is the CELEX lexical data base 



of English (Burnage, 1990). All items in the CELEX 
data bases contain hyphenated spelling, syllabified 
and stressed phonemic transcriptions, and detailed 
morphological analyses. We extracted from the En- 
glish data base of CELEX all the above information, 
resulting in a data base containing 77,565 unique 
items (word forms with syllabified, stressed pronun- 
ciations and morphological segmentations). 

For use in experiments with learning algorithms, 
the data is preprocessed to derive fixed-size in- 
stances. In the experiments reported in this paper 
different morpho-phonological (sub) tasks are inves- 
tigated; for each (sub)task, an instance base (train- 
ing set) is constructed containing instances produced 
by windowing ( {Bcjiiowski and Rosenberg, 1987 ) and 
attaching to each instance the classification appro- 
priate for the (sub)task under investigation. Table |l| 
displays example instances derived from the sample 
word booking. With this method, for each (sub) task 
an instance base of 675,745 instances is built. 

In the table, six classification fields are shown, one 
of which is a composite field; each field refers to one 
of the (sub)tasks investigated here. M stands for 
morphological decomposition: determine whether a 
letter is the initial letter of a morpheme (class '1') 
or not (class '0'). A is graphemic parsing^: deter- 
mine whether a letter is the first or only letter of a 
grapheme (class '1') or not (class '0'); a grapheme is 
a cluster of one or more letters mapping to a single 
phoneme. G is grapheme-phoneme conversion: de- 
termine the phonemic mapping of the middle letter. 
Y is syllabification: determine whether the middle 
phoneme is syllable-initial. S is stress assignment: 
determine the stress level of the middle phoneme. 
Finally, GS is integrated grapheme-phoneme conver- 
sion and stress assignment. The example instances 
in Table [j] show that each (sub)task is phrased as a 



2 Graphemic parsing is not represented in the CELEX 
data. We used an automatic alignm ent algorithm 



(Daelemans and Van den Bosch, 1997) to determine 
which letters are the first or only letters of a grapheme. 







letter-window instances 


phoneme-window instances 




lnst&ncc 


left 




right 


classifications 


left 




right 

o 


classif. 


number 


context 


focus 


context 


M A G S GS 


context 


focus 


context 


Y S 


i 




b 


o o k 


1 1 /b/ 1 /b/1 




N 


N l-l N 


1 1 


2 


_ _ b 





o k i 


1 /u/ /u/0 


- - N 


M 


l-l N N 





3 


_ b o 





k i n 


/-/ /-/0 


- N N 


l-l 


N hi A]/ 





4 


boo 


k 


i n g 


1 /k/ /k/0 


N M l-l 


N 


N Ivl l-l 


1 


5 


o o k 


i 


n g - 


i i A/ o /i/o 


M l-l N 


N 


A)/ /-/ - 





6 


o k i 


n 


g - - 


oi/9/o A)/o 


l-l N N 


A)/ 


/-/ - - 





7 


k i n 


g 




/-/ /-/0 


N N A)/ 


l-l 








Table 1: Example of instances generated from the word booking, with classifications for all of the subtasks 
investigated, viz. M, A, G, Y, S, and GS. 



classification task on the basis of windows of letters 
or phonemes (the stress assignment task S is inves- 
tigated with both letters and phonemes as input). 
Each window represents a snapshot of a part of a 
word or phonemic transcription, and is labelled by 
the classification associated with the middle letter of 
the window. For example, the first letter-window in- 
stance book is linked with label '1' for the morpho- 
logical segmentation task (m) , since the middle letter 
b is the first letter of the morpheme book; the other 
instance labelled with morphological-segmentation 
class '1' is the instance with i in the middle, since 
i is the first letter of the (inflectional) morpheme 
ing. Classifications may either be binary ('1' or 
'0') for the segmentation tasks (m, a, and y), or 
have more values, such as 62 possible phonemes (g) 
or three stress markers (primary, secondary, or no 
stress, s), or a combination of these classes (159 com- 
bined phonemes and stress markers, GS). 

2.3 Methodology 

Our empirical study focuses on measuring the abil- 
ity of the igtree learning algorithm to use the 
knowledge accumulated during learning for the clas- 
sification of new, unseen instances of the same 
(sub)t ask, i.e., we measure their gen eralisation accu- 
racy. ( Weiss and Kulikowski, 1991 ) describe n-fold 
cross validation (n-fold cv) as a procedure for mea- 
suring generalisation accuracy. For our experiments 
with igtree, we set up 10-fold CV experiments con- 
sisting of five steps, (i) On the basis of a data set, n 
partitionings are generated of the data set into one 
training set containing ((n— l)/n)th of the data set, 
and one test set containing (l/n)th of the data set, 
per partitioning. For each partitioning, the three 
following steps are repeated: (ii) Information-gain 
values for all (seven) features are computed on the 
basis of the training set (cf. Subsection [Tl]). (iii) 
igtree is applied to the training set, yielding an 



induced decision tree (cf. Subsection 2.1). (iv) The 
tree is tested by letting it classify all instances in the 
test set, which results in a percentage of incorrectly 
classified test instances, (v) When each of the n folds 
has produced an error percentage on test material, 
a mean generalisation error of the learned model is 
computed. ( Weiss and Kulikowski, 1991 ) argue that 
by using n-fold CV, preferably with n > 10, one can 
retrieve a good estimate of the true generalisation 
error of a learning algorithm given an instance base. 
Mean results can be employed further in significance 
tests. In our experiments, n = 10, and one-tailed t- 
tests are performed. 

3 Three word-pronunciation 
architectures 

Our experiments are grouped in three series, each 
involving the application of igtree to a particu- 
lar word-pronunciation system. The architectures 
of these systems are displayed in Figure [|. In the 
following subsections, each system is introduced, an 
outline is given of the experiments performed on the 
system, and the results are briefly discussed. 

3.1 M-A-G-Y-S 

The architecture of the m-A-G-y-S system is inspired 



by SOUNDl (Hunnicutt, 1976; Eunnicutt, 1980) 



the word-pronunciation subsystem of the mitalk 



text-to-speech system (Allen, Hunnicutt, and Klatt 



1987). When the mitalk system is faced with an un- 



known word, SOUNDl produces on the basis of that 
word a phonemic transcription with stress markers 
flAllcn, Hunnicutt, and Klatt, 1987| ). This word- 
pronunciation process is divided into the following 
five processing components: 

1. morphological segmentation, which we imple- 
ment as the module referred to as M; 



phonemic transcription 
with stress 
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Figure 1: Architectures of the three investigated word-pronunciation systems. Left: m-A-G-y-S; middle: 
M-G-S; right: GS. Rectangular boxes represent modules; the letter in the box corresponds to the subtask as 
listed in the lcgcnda (far right). Arrows depict data flows from the raw input or a module, to a module or 
the output. 



2. graphemic parsing, module A; 

3. grapheme-phoneme conversion, module G; 

4. syllabification, module Y; 

5. stress assignment, module S. 

The architecture of the m-A-G-y-S system is visu- 
alised in the left of Figure |l|. It can be seen that the 
representations include direct output from previous 
modules, as well as representations from earlier mod- 
ules. For example, the S module takes as input the 
syllabic boundaries generated by the Y module, but 
also the phoneme string generated by the G module, 
and the morpheme boundaries generated by the M 
module. 

m-a-g-y-s is put to the test by applying igtree 
in 10-fold cv experiments to the five subtasks, con- 
necting the modules after training, and measuring 
the combined score on correctly classified phonemes 
and stress markers, which is the desired output of 
the word-pronunciation system. An individual mod- 
ule can be trained on data from CELEX directly as 
input, but this method ignores the fact that mod- 
ules in a working modular system can be expected 
to generate some amount of error. When one module 
generates an error, the subsequent module receives 
this error as input, assumes it is correct, and may 
generate another error. In a five-module system, this 
type of cascading errors may seriously hamper gen- 
eralisation accuracy. To counteract this potential 
disadvantage, modules can also be trained on the 
output of previous modules. Modules cannot be ex- 
pected to learn to repair completely random, irreg- 
ular errors, but whenever a previous module makes 



consistent errors on a specific input, this may be 
recognised by the subsequent module. Having de- 
tected a consistent error, the subsequent module is 
then able to repair the error and continue with suc- 
cessful processing. Earlier experiments performed 
on the tasks investigated in this paper have shown 
that classification errors on test instances are indeed 
consistently and significantly decreased when mod- 
ules are trained on the output of previous modules 
rather than on data extracted directly from CELEX 



(Van den Bosch, 1997). Therefore, we train the M-A- 



G-Y-S system, with igtree, by training the modules 
of the system on the output of predecessing modules. 
We henceforth refer to this type of training as adap- 
tive training, referring to the adaptation of a module 
to the errors of a predecessing module. 

Figure displays the results obtained with igtree 
under the adaptive variant of m-A-G-y-S. The fig- 
ure shows all percentages (displayed above the bars; 
error bars on top of the main bars indicate standard 
deviations) of incorrectly classified instances for each 
of the five subtasks, and a joint error on incorrectly 
classified phonemes with stress markers, which is the 
desired output of the system. The latter classifica- 
tion error, labelled PS in Figure ||, regards classifi- 
cation of an instance as incorrect if cither or both 
of the phoneme and stress marker is incorrect. The 
figure shows that the joint error on phonemes and 
stress markers is 10.59% of test instances, on aver- 
age. Computed in terms of transcribed words, only 
35.89% of all test words are converted to stressed 
phonemic transcriptions flawlessly. The joint error 
is lower than the sum of the errors on the G subtask 
and the S subtask, 12.95%, suggesting that about 



10.59 

l — I — l 
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Figure 2: Generalisation errors on the M-A-G-Y-S 
system in terms of the percentage of incorrectly clas- 
sified test instances by igtree on the five subtasks 
M, A, G, Y, and S, and on phonemes and stress mark- 
ers jointly (PS). 



Figure 3: Generalisation errors on the M-G-S system 
in terms of the percentage of incorrectly classified 
test instances by igtree on the three subtasks M, 
G, and S, and on phonemes and stress markers jointly 
(PS). 



20% of the incorrectly classified test instances in- 
volve an incorrect classification of both the phoneme 
and the stress marker. 

3.2 M-G-S 

The subtasks of graphemic parsing (a) and 
grapheme-phoneme conversion (g) are clearly re- 
lated. While A attempts to parse a letter string 
into graphemes, G converts graphemes to phonemes. 
Although they are performed independently in M- 
A-G-Y-S, they can be integrated easily when the 
class- '1 '-instances of the A task are mapped to their 
associated phoneme rather than '1', and the class- 
'O'-instances are mapped to a phonemic null, /-/, 
rather than '0' (cf. Table Q). This t ask integration 
is also used in the nettalk model (Bejnowski and 



Rosenberg, 1987). A similar argument can be made 



for integrating the syllabification and stress assign- 
ment modules into a single stress-assignment mod- 
ule. Stress markers, in our definition of the stress- 
assignment subtask, are placed solely on the posi- 
tions which are also marked as syllable boundaries 
(i.e., on syllabic- initial phonemes). Removing the 
syllabification subtask makes finding those syllable 
boundaries which are relevant for stress assignment 
an integrated part of stress assignment. Syllabifica- 
tion (y) and stress assignment (s) can thus be inte- 
grated in a single stress-assignment module S. 

When both pairs of modules are reduced to sin- 
gle modules, the three-module system M-G-S is ob- 
tained. Figure |l| displays the architecture of the 
M-G-S system in the middle. Experiments on this 
system are performed analogous to the experiments 
with the M-A-G-Y-S system; Figure^ displays the av- 
erage percentages of generalisation errors generated 



by igtree on the three subtasks and phonemes and 
stress markers jointly (the error bar labelled PS). 

Removing graphemic parsing (a) and syllabifica- 
tion (y) as explicit in-between modules yields bet- 
ter accuracies on the grapheme-phoneme conver- 
sion (g) and stress assignment (s) subtasks than 
in the m-A-G-y-S system. Both differences are sig- 
nificant; for G, (t(19) = 43.70, p < 0.001), and for 
S (f(19) = 32.00, p < 0.001). The joint accuracy 
on phonemes and stress markers is also significantly 
better in the M-G-S system than in the m-A-G-y-S 
system (t(37.50,p < 0.001). Different from m-A-G- 
y-S, the sum of the errors on phonemes and stress 
markers, 8.09%, is hardly more than the joint er- 
ror on PSs, 7.86%: there is hardly an overlap in 
instances with incorrectly classified phonemes and 
stress markers. The percentage of flawlessly pro- 
cessed test words is 44.89%, which is markedly bet- 
ter than the 35.89% of m-a-g-y-s. 

3.3 GS 

GS is a single-module system in which only one clas- 
sification task is performed in one pass. The GS 
task integrates grapheme-phoneme conversion and 
stress assignment: to classify letter windows as cor- 
responding to a phoneme with a stress marker (PS). 
In the GS system, a PS can be either (i) a phoneme 
or a phonemic null with stress marker '0', or (ii) 
a phoneme with stress marker '1' (i.e., the first 
phoneme of a syllable receiving primary stress), or 
(iii) a phoneme with stress marker '2' (i.e., the first 
phoneme of a syllable receiving secondary stress). 
The simple architecture of GS, which does not reflect 
any linguistic expert knowledge about decomposi- 
tions of the word-pronunciation task, is visualised 
as the rightmost architecture in Figure |]. It only 
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Figure 4: Percentage of generalisation errors made 
by igtree on the GS task, in terms of the percent- 
age incorrectly classified test instances as well as on 
phonemes and stress assignments computed sepa- 
rately. 



Figure 5: Average numbers of nodes in the decision 
trees generated by igtree for the M-A-G-Y-S, M- 
G-S, and GS systems. Compartments indicate the 
numbers of nodes needed for the trees of the subtasks 
specified by their labels. 



assumes the presence of letters at the input, and 
phonemes and stress markers at the output. Ta- 
ble [j] displays example instance PS classifications 
generated on the basis of the word booking. The 
phonemes with stress markers (PSs) are denoted by 
composite labels. For example, the first instance in 
Table [l], book, maps to class label /b/1, denot- 
ing a /b/ which is the first phoneme of a syllable 
receiving primary stress. 

The experiments with GS were performed with the 
same data set of word pronunciation as used with M- 
A-G-Y-S and M-G-S. The number of PS classes (i.e., 
all possible combinations of phonemes and stress 
markers) occurring in this data base of tasks is 159. 
Figure |] displays the generalisation errors in terms 
of incorrectly classified test instances. The figure 
also displays the percentage of classification errors 
made on phonemes and stress markers computed 
separately. 

igtree yields significantly better generalisation 
accuracy on phonemes and stress markers, both 
jointly and independently. In terms of PSs, the accu- 
racy on GS is significantly better than that of M-G-S 
with (t(19) = 40.48,p < 0.001), and that of M-A- 
G-Y-S with (i(19) = 6.90,p < 0.001). Its accuracy 
on flawlessly transcribed test words, 59.38%, is also 
considerably better than that of the modular sys- 
tems. Compared to accuracies reported in related 



research on learning English word pronunciation( 3c 



jnowski and Rosenberg, 1987; Wolpert, 199C; Diet 



terich, Hild, and Bakiri, 1995; Yvon, 1996) and on 



general quality demands of text-to-speech applica- 
tions, an error of 3.79% on phonemes and 30.62% 
on words can be considered adequate, though still 



not excellent (^von, 1996: Van den Bosch, 1997) 



4 Comparisons of M-A-G-Y-S, 
M-G-S, and GS 

We have given significance results showing that, un- 
der our experimental conditions and using igtree 
as the learning algorithm, optimal generalisation ac- 
curacy on word pronunciation is obtained with GS, 
the system that does not incorporate any explicit 
decomposition of the word-pronunciation task. In 
this section we perform two additional comparisons 
of the three systems. First, we compare the sizes of 
the trees constructed by igtree on the three sys- 
tems; second, we analyse the positive and negative 
effects of learning the subtasks in their specific sys- 
tems' context. 

Tree sizes 

An advantage of using less or no decompositions in 
terms of computational efficiency is the total amount 
of memory needed for storing the trees. Although 
the application of igtree generally results in small 
trees that fit well inside small computer memories 
(for our modular (sub)tasks, tree sizes vary from 
64,821 nodes for the M-modules to 153,678 nodes 
for the G-module in m-A-G-y-S, occupying 453,747 
to 1,075,746 bytes of memory), keeping five trees in 
memory would not be a desirable feature for a sys- 
tem optimised on memory use. Figure || displays 
the summed number of nodes for each of the four 
iGTREE-trained systems under the adaptive variant. 
Each bar is divided into compartments indicating 
the amount of nodes in the trees generated for each 
of the modular subtasks. 

Figure || shows that the model with the best gen- 
eralisation accuracy, GS, is also the model taking up 
the smallest number of nodes. The amount of nodes 



in the single GS tree, 111,062, is not only smaller 
than the sum of the amount of nodes needed for 
the G and S modules in the M-G-S system (204,345 
nodes); it is even smaller than the single tree con- 
structed for the G subtask in the M-G-S system 
(125,182 nodes). 

A minor difference in tree size can be seen between 
the trees built for the G-module in the M-G-S system, 
125,182 nodes, and the G-module in the m-A-G-y-S 
system, 153,678 nodes. A similar difference can be 
seen for the S-modules, taking up 79,163 nodes in 
the M-G-S system, and 96,998 nodes in the m-A-G- 
y-S system. The size of the trees built for modules 
appears to increase when the module is preceded by 
more modules, which suggests that IGTREE is faced 
with a more complex task, including potentially er- 
roneous output from more modules, when building 
a tree for a module further down a sequence of mod- 
ules. 

Utility effects 

The particular sequence of the five modules as in 
the M-A-G-Y-S system reflects a number of assump- 
tions on the utility of using output from one subtask 
as input to another subtask. Morphological knowl- 
edge is useful as input to grapheme-phoneme conver- 
sion (e.g., to avoid pronouncing ph in loophole as /f/, 
or red in barred as /red/); graphemic parsing is use- 
ful as input to grapheme-phoneme conversion (e.g., 
to avoid the pronunciation of gh in through); etc. 
Thus, feeding the output of a module A into a subse- 
quent module B implies that one expects to perform 
better on module B with A's input than without. 
The accuracy results obtained with the modules of 
the M-A-G-Y-S, M-G-S, and GS systems can serve as 
tests for their respective underlying utility assump- 
tions, when they are compared to the accuracies ob- 
tained with their subtasks learned in isolation. 

To measure the utility effects of including the out- 
puts of modules as inputs to other modules, we per- 
formed the following experiments: 

1. We applied igtree in 10-fold CV experiments to 
each of the five subtasks m, a, G, y, and S, only 
using letters (with the M, A, G, and S subtasks) 
or phonemes (with the Y and the S subtasks) 
as input, and their respective classification as 
output (cf. Table [l]) . The input is directly ex- 
tracted from CELEX. These experiments pro- 
vide the baseline score for each subtask, and 
are referred to as the isolated experiments. 

2. We applied igtree in 10-fold CV experiments 
to all subtasks of the M-A-G-Y-S, M-G-S, and GS 
systems, training and testing on input extracted 



sub- 




% generalisation error 




task 


isolated 


ideal (utility) 


actual 


(utility) 


M-A-G-Y-S 


M 


5.14 


5.14 (0.00) 


5.14 


(0.00) 


A 


1.39 


1.66 (-0.27) 


1.50 


(-0.11) 


G 


3.72 


3.68 (+0.04) 


7.67 


(-3.95) 


Y 


0.45 


0.75 (-0.30) 


2.63 


(-2.16) 


S 


7.96 


2.67 (+5.29) 


5.28 


(+2.68) 


M-G-S 


M 


5.14 


5.14 (0.00) 


5.14 


(0.00) 


G 


3.72 


3.66 (+0.06) 


3.99 


(-0.27) 


S 


7.96 


3.97 (+3.99) 


4.10 


(+3.86) 


GS 


G 


3.72 




3.79 


(-0.07) 


S 


4.71 




3.97 


(+0.74) 



Table 2: Overview of utility effects of learning sub- 
tasks (m, A, G, Y, and s) as modules or partial tasks 
in the m-A-G-y-S, m-G-S, and GS systems. For each 
module, in each system, the utility of training the 
module with ideal data (middle) and actual, modu- 
lar data under the adaptive variant (right), is com- 
pared against the accuracy obtained with learning 
the subtasks in isolation (left). Accuracies are given 
in percentage of incorrectly classified test instances. 



directly from CELEX. The results from these ex- 
periments reflect what would be the accuracy of 
the modular systems when each module would 
perform perfectly flawless. We refer to these ex- 
periments as ideal. 

With the results of these experiments we mea- 
sure, for each subtask in each of the three systems, 
the utility effect of including the input of preceding 
modules, for the ideal case (with input straight from 
CELEx) as well as for the actual case (with input 
from preceding modules). A utility effect is the dif- 
ference between igtree's generalisation error on the 
subtask in modular context (either ideal or actual) 
and its accuracy on the same subtask in isolation. 
Table || lists all computed utility effects. 

For the case of the m-A-G-y-S system, it can 
be seen that the only large utility effect, even in 
the ideal case, could be obtained with the stress- 
assignment subtask. In the isolated case, the input 
consists of phonemes; in the m-A-G-y-S system, the 
input contains morpheme boundaries, phonemes, 
and syllable boundaries. The ideal positive effect 
on the S module of 5.29% less errors turns out 
to be a positive effect of 2.68% in the actual sys- 
tem. The latter positive effect is outweighed by a 



rather large negative utility effect on the grapheme- 
phoneme conversion task of —3.95%. Both the A and 
Y subtasks do not profit from morphological bound- 
aries as input, even in the ideal case: in the actual M- 



5 Related work 



The classical nettalk paper by flSejnowski and 



a-g-y-s system, the utility effect of including mor- Rosenberg, 1987j ) can be seen as a primary source 



phological boundaries from M and phonemes from G 
in the syllabification module Y is markedly negative: 
-2.16%. 

In the M-G-S system, the utility effects are gen- 
erally less negative than in the m-A-G-y-S system. 



of inspiration for the present study; it has been so 
for a considerable amount of related work. Although 
it has been criticised for being vague and presumptu- 
ous and for presenting generalisation accuracies that 
can be improved easily with other learning meth- 



There is a small utility effect in the ideal case 
with including morphological boundaries as input 
to grapheme-phoneme conversion; in the actual M- 
G-S system, the utility effect is negative (—0.27%). 
The stress- assignment module benefits from includ- 
ing morphological boundaries and phonemes in its 
input, both in the ideal case and in the actual M-G- 



ods (Stanfill and Waltz, 1986; Wolpert, 1990; Weij 



ters, 1991; Yvon, 1996), it was the first paper to 



investigate grapheme-phoneme conversion as an in- 
teresting application for general-purpose learning al- 
gorithms. However, few reports have been made on 
the joint accuracies on stress markers and phonemes 
in work on the nettalk data. To our knowledge, 



3 system. 



only (Shavlik, Mooncy, and Towcll, 1991 ) and (D 



etterich, Hild, and Bakiri, 1995) provides such re- 



The GS system does not contain separate mod- 
ules, but it is possible to compare the errors made 
on phonemes and stress assignments separately to 
the results obtained on the subtasks learned in isola- 
tion. Grapheme-phoneme conversion is learned with 
almost the same accuracy when learned in isolation 
as when learned as partial task of the GS task. Learn- 
ing the grapheme-phoneme task, igtree is neither 
helped nor hampered significantly by learning stress 
assignment simultaneously. There is a positive util- 
ity effect in learning stress assignment, however. 
When stress assignment is learned in isolation with 
letters as input, igtree classifies 4.71% of test in- 
stances incorrectly, on average. (This is a lower error 

than obtained with learning stress assignment on the 



ports. I n terms of incorrectly processed tes t in- 
stances, ( Shavlik, Mooncy, and Towell, 1991 ) ob- 
tain better performance with the back-propagation 
algorithm trained on di stributed outp ut (27.7% er- 
rors) than with the id3 ( Quinlan, 1986 ) decision-tree 
algorithm (34.7% errors), both trained and tested 
on small non-overlapping sets of about 1,000 in- 
stances. ( Dicttcrich, Hild, and Bakiri, 1995] ) re- 
ports similar errors on similarly-sized training and 
test sets (29.1% for bp and 34.4% for id3); with a 
larger training set of 19,003 words from the nettalk 
data and an input encoding fifteen letters, previous 
phoneme and stress classifications, some domain- 
specific features, and error-correcting output codes 
id3 generates 8.6% errors on test instances (Diet 



ha.sis of phonemes indicating that stress assignment tcrich, Hild, and Bakiri, 1995| ), which does not com- 



should take letters as input rather than phonemes.) 
When the stress-assignment task is learned along 
with grapheme-phoneme conversion in the GS sys- 
tem, a marked improvement is obtained: 0.74% less 
classification errors are made. 

Summarising, comparing the accuracies on modu- 
lar subtasks to the accuracies on their isolated coun- 
terpart tasks shows only a few positive utility effects 
in the actual system, all obtained with stress as- 
signment. The largest utility effect is found on the 
stress-assignment subtask of M-G-S. However, this 
positive utility effect does not lead to optimal ac- 
curacy on the S subtask; in the GS system, stress 
assignment is performed with letters as input, yield- 
ing the best accuracy on stress assignment in our 
investigations, viz. 3.97% incorrectly classified test 
instances. 



pare favourably to the results obtained with the 
NETTALK-like GS task (a valid comparison cannot 
be made; the data employed in the current study 
contains considerably more instances). 

An interesting counterargument against the repre- 
sentation of the word-pronunciation task using fixed- 
size windows, put forward by Yvon (Yvon, 1996), is 
that an inductive-learning approach to grapheme- 
phoneme conversion should be based on associating 
variable-length chunks of letters to variable-length 
chunks of phonemes. The chunk-based approach 
is shown to be applicable, with adequate accu- 
racy, to several corpora, including corpora of French 
word pronunciations and, as mentioned above, the 



nettalk data (Yvon, 1996). Experiments on other 
(larger) corpora, comparing both approaches, would 
be needed to analyse their differences empirically. 



6 Discussion 

We have demonstrated that a decision-tree learning 
algorithm, igtree, is able to learn English word pro- 
nunciation with modest to adequate generalisation 
accuracy: the less the learning task is decomposed in 
subtasks, the more adequate the generalisation accu- 
racy obtained by igtree is. The best generalisation 
accuracy is obtained with the GS system, which does 
not decompose the task at all. The general disad- 



in terms of generalisation accuracy to irregular er- 
rors in the input of a modular subtask. Although 
irregular errors are an inherent problem for modu- 
lar systems, other learning algorithms may be able 
to handle such errors differently. Experiments with 
back-propagation learning applied to the same mod- 
ular systems show siginficantly worse performance 
than that of igtree (Van den Bosch, 1997). ft 
might be possible that instance-based learning algo- 
rithm s (e.g., IB1-IG flPaelemans and Van den Bosch 



vantage of the investigated modular systems is that 1992 ; |Daclcmans, Van den Bosch, and Wcijtcrs 



modules do not perform their tasks flawlessly, while 1997| )), which have been demonstrated to outper 



their expert-based decompositions do assume flaw- form igtree on several lang 


uage tasks ( 


Daclcmans, 


Less performance. In practice, modules produce a Gillis, and Duricux, 1994 




Van den Bosch, Daele- 


considerable amount of irregular errors which cause mans, and Weijters, 1996 




Van den Bosch, 1997), 



subsequent modules to generate subsequent 'cascad- 
ing' errors. Only the subtask of stress assignment is 
shown to be learned more successfully on the basis 
of modular input. 

The best-performing system, GS, is trained to map 
windows of letters to combined class labels repre- 
senting phonemes and stress markers. Compared 
to the m-A-G-y-S and M-G-S systems, the GS sys- 
tem (i) lacks an explicit morphological segmenta- 
tion and (ii) learns stress assignment jointly with 
grapheme-phoneme conversion on the basis of let- 
ter windows rather than phoneme windows. These 
two advantageous properties of the GS system lead 
to three suggestions. First, it appears better to leave 
morphological segmentation an implicit subtask; it 
can be left to the learning algorithm to extract the 
necessary morphological information needed to dis- 
ambiguate between alternative pronunciations di- 
rectly from the letter- window input. Second, letter- 
window instances provide the most reliable source of 
input for both grapheme-phoneme conversion and 
stress assignment. Third, stress assignment and 
grapheme-phoneme conversion can be integrated in 
one task, i.e., to map letter instances to 'stressed 
phonemes'. 

A warning on the scope of these suggestions needs 
to be issued. The results described here are not 
only dependent of the resource (celex) and the 
(sub) task definitions (classification of windowed in- 
stances), but also on the use of igtree as the learn- 
ing algorithm. The celex data appears robust and 
provides an abundance of English word pronunci- 
ations, not an inappropriately skewed subset of the 
English vocabulary. The windowing method appears 
a salient method to rephrase language tasks as clas- 
sification tasks based on fixed-length inputs. It is 
not clear, however, to what extent igtree can be 
held responsible for the low accuracy on m-A-G-y- 
S and M-G-S; igtree may be negatively sensitive 



perform better on the modular systems. Although 
such systems trained with ib1-ig would be compu- 
tationally rather inefficient ( Van den Bosch, 1997 ), 
employing ibI-ig in learning modular subtasks may 
lead to other differences in accuracy between modu- 
lar systems. 

A conclusion to be drawn from our study is that 
it is possible to learn the complex language task of 
English word pronunciation with a general-purpose 
inductive-learning algorithm, with an adequate level 
of generalisation accuracy. The results suggest that 
the necessity of decomposing word-pronunciation 
in several subtasks should be reconsidered care- 
fully when designing an accuracy-oriented word- 
pronunciation system. Undesired errors generated 
by sequenced modules may outweigh the desired pos- 
itive utility effects easily. 
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